Improving performance of automated scoring through detection of outliers and understanding model instabilities
نویسندگان
چکیده
Automated scoring of writing operates under the assumption that the quality of a student essay can be characterized by a model of weighted features that are extracted from the essay. Because the automated scoring model is trained predicated on a representative sample of essays for a prompt, associated with each feature is an expected range of acceptable values based on the distribution in the training set. However, if the value for a particular feature or combined values for a group of features extend beyond the training range, the assumptions of the scoring model may be violated and may cause instabilities in the model. This issue is often compounded by utilizing features that are not distributed normally or that exhibit a nonlinear relationship to scores. This talk will describe a series of analyses which use methods that examine the impact on scoring of essays at or beyond the boundaries exhibited in the training set. The talk will further illustrate how these approaches contribute to developing scoring confidence measures and the detection of outlier essays that could be due to construct-irrelevant responses. Such scoring confidence measures can be integrated into a combined human/computer-based scoring scheme in which essays with low computer-scoring confidence can be passed on for human judgment. Improving performance of automated scoring 3 Improving performance of automated scoring through detection of outliers and understanding model instabilities Automated scoring of essays offers a means to judge the quality of student essays with rapid feedback to teachers, students and decision-makers. While automation allows nearly immediate and often cost-effective scoring, it is critical to ensure that that the techniques used also produce accurate, reliable, and valid scoring for the set of essays that will be input to the system (e.g., Williamson et al., 2010). However, designers of automated techniques cannot always anticipate the full diversity of inputs they will receive. The models may be asked to score essays that are highly unusual or not representative of the expected input. Thus, it becomes incumbent on developers of scoring systems to ensure that not only are these systems able to accurately score a wide range of input essays, but they are also able to judge the potential of any incoming essay to not be accurately scored. Techniques for automated scoring of writing typically measure the quality of an essay by extracting a set of features from the essay and then combining the features using a statistical model that is based on a training set of essays. In developing a scoring system, certain assumptions are made about the distribution of the essays used for training the system as well as the statistical characteristics of the features and the modeling techniques used. With careful consideration of the assumptions being made, it is possible to analyze information from essay features and modeling techniques to develop methods that provide levels of confidence that an essay can be scored accurately. In addition, these approaches may permit a better understanding of how different essay features contribute to an essay’s score and how the features may operate under different modeling assumptions that can affect scoring. In this paper we provide a series of Improving performance of automated scoring 4 analyses which examine the impact of essay features and model assumptions on scoring and describe how these can be used to improve the detection of essays that can not be scored accurately. Outlier detection considered within the automated scoring process Throughout the development of automated scoring systems and any item specific automated scoring model, there are a number of stages that influence both the reliability of the scoring as well as help determine the suitability of a given essay for automated scoring. These stages include: 1. The collection of essays used to train the scoring system. 2. The collection of scores from human raters for the training essays. 3. The creation and testing of features and algorithms that most reliably detect components of student writing quality and knowledge. 4. The use of methods that detect essays that may be scored less reliably by the automated scoring methods in the implemented system. While we have previously focused on how these stages impact reliability (see Foltz et al., 2012), here we consider how aspects of the scoring process impact detection of outlier essays that may be scored less accurately. The collection of essays used for the training set provide some boundaries for determining what will be considered appropriate input for scoring. Essays used in the training set should reflect distributional properties of the expected essays, and generally, the scores for the training essays should represent a normal distribution, while ensuring that there are sufficient Improving performance of automated scoring 5 (e.g., at least a minimum of 10-20) examples at each score point. (See Foltz et al., 2012). In addition, the training essays should be representative of the expected set of essays that will be scored including considerations of the topics covered, length, type of language used in the essays as well as sampled appropriately from the expected student population for gender, ethnicity, and skill levels. (e.g., Williamson et al., 2010). This training set, therefore provides a characterization of the expected range of essays that will be received for scoring. The features used to analyze the essays provide a means to ascertain how well any essay falls within the distributional confines of the training set. These features can include measuring aspects of the writing such as the caliber of the student’s expression and organization of words and sentences, the student’s knowledge of the domain content, the quality of the student’s reasoning, and the student’s skills in language use, grammar and mechanics of writing. Generally, these features need to be construct relevant, provide effective measures of the target skill, and are well represented in the expected population of essays to be scored. Building a scoring model based on these features makes implicit assumptions that the features will be represented across the spectrum of essays and that they will behave in a sufficiently regular manner to allow successful modeling of their behavior. The algorithms, or modeling techniques, similarly make assumptions about features that will be used. For example, in linear regression-based approaches, the models posit that features will behave in a linear fashion within the range of expected essays, even when they are considered within a multivariate context. Such approaches also tend to assume that features will have normal distributions however, many language features have non-normal distributions, such as the Zipf distribution for word frequency. Non-linear and Bayesian-based approaches make analogous types of assumptions within the constraints of the range of their training sets. In each Improving performance of automated scoring 6 of these cases, it is assumed that variables falling within the expected range will provide stable estimates. However, if variables fall outside of the range, it is not always clear whether the algorithms will provide stable estimates and may degrade scoring accuracy. By analyzing the features of the essays from the training set, we can therefore derive an expected range of essay features. The system can then determine if the value for a particular feature or the combined values for a group of features are beyond the training range. Those that fall outside the confidence interval of this range may indicate potential violations of the assumptions of the scoring model and may cause instabilities in the model. Below we examine approaches to utilizing these assumptions to detect outlier essays and essays that fall in a part of the feature space where the model may be less stable. We illustrate three approaches to detection of outliers and the effects of model assumptions across the values of different scoring features. All the presented analyses were performed using data from a single, advanced high-school/entry level college prompt with 559 essays. The median essay length for the prompt was 389 words with an interquartile range of 173 and the median sentence count was 19 with interquartile range of 9. The approach generalizes to other prompts and essay sets, but is illustrated through analysis of features of a single prompt for concreteness and simplicity. Not all of the phenomena discussed are evidenced in every prompt and it is important to note that while this paper provides examples of a variety of features used to measure performance in essays, not all features described in this paper are used operationally in the Intelligent Essay Assessor for scoring. Thus, the focus of the paper is on illustrating how such generalized approaches can be used across a range of different types of features for detecting essays that deviate from the norm rather than any specific instantiation used in the IEA. Improving performance of automated scoring 7 Multivariate Normality as a test for outliers An essay that may be an outlier can be classified by it values on a single feature or it could be the conjoint values across multiple features. Indeed, for the case of multiple features, the value of each individual feature may appear within an acceptable range, but from a multivariate perspective, the combination of features may fall in a part of the feature space were there is little or no data from the training set. In such areas of sparsity, it is less certain that that the values of the features will be combine in a way that will be representative of the best estimate of the scores for the essay It is often convenient to group the features describing responses into functional sets, which allows harnessing the covariance structure of each set to guide an evaluation of how far a given response is from responses in the training set. Comparability of that functional set to those from the training set provides a measure of confidence that the response is sufficiently represented by the responses within the training set to be accurately scored and conversely to signal responses inappropriate for a given scoring model. A distance measure within feature space, such as a generalized distance, is required to allow these types of comparisons. While generalized distance measures, such as the Mahalanobis Distance (1936), do not require distributional assumptions, the interpretation of the resulting distances (i.e. how large a separation should raise concern) is more straightforward if the underlying distribution is multivariate normal. This approach allows detecting the degree to which feature sets are multivariate normal. For many of our feature sets, this assumption is violated, where we see long tails as the most common deviation from normality. Improving performance of automated scoring 8 Many statistics are available to help judge multivariate normality, such as the Cox-Small test (1978), though for a better understanding of the underlying phenomena, we tend to favor visualization. The asymptotic Chi-squared distribution of the Mahalanobis distances of data generated under multivariate normality, allows us to follow Healy (1968) in producing and examining the multivariate normal plot. This type of plot may feel familiar since it is quite similar to the quantile-quantile plots that are commonly used in comparing distributions of univariate data. The following plots show the generalized distances from the same set of training responses, while varying the feature sets. The distances are plotted against the quantiles of the Chi-squared distribution. Deviations from the 45-degree line indicate departure from multivariate normality. The plots also include a bootstrap 95% confidence interval, indicated in red, to help guide interpretation. The two plots in Figure 1 provide a baseline example of the behavior of generalized distance plots with the left plot indicating what a multivariate normal plot based on drawing from a 20 dimensional multivariate normal distribution of variables looks like. This pattern is then contrasted in the right plot by drawing from a multivariate t distribution with 10 degrees of freedom, which is slightly overdispersed from the normal. Clearly, the 95% confidence interval for this plot excludes the diagonal for much of the range in the right plot. This illustrates the basis to detect deviations from multivariate distributions. Improving performance of automated scoring 9 Figure 1a and b: Multivariate Normal plot. 1a uses random samples from a multivariate normal distribution. 1b uses samples drawn from a multivariate t distribution with 10 degrees of freedom. Note that the samples in 1b cause deviations of confidence interval from the 45 degree line. The next three plots show examples taken from feature sets that are relevant for scoring: 1) coherence features, which measure aspects of the flow across sentences as well as how well each sentence contributes to the coherence of the overall essay (see Foltz, Kintsch & Landauer, 1998), 2) features derived from statistical language models (based on n-gram features (e.g., Jurafsky & Martin, 2009), and finally 3) a set of readability features. In all cases, except for the readability features multivariate normality is violated. Improving performance of automated scoring 10 Figure 2: Multivariate normal plot of coherence features Figure 3: Multivariate normal plot of statistical language features Improving performance of automated scoring 11 Figure 4: Multivariate normal plot of readability features At least two lessons can be drawn from these examples, with the first being that confidence intervals for responses for these feature sets based on a multivariate normality assumption are going to be conservative. A significant number of responses that almost certainly could have been accurately scored will be falsely identified as outliers, which has a cost impact on automated scoring if the identified responses are then sent to human scorers when in fact automated scoring would have been sufficient. The second lesson is that to move away from conservative criteria requires a larger set of responses to allow more accurate estimates for the tails of these distributions. Improving performance of automated scoring 12 Nonlinearities in features From a theoretical perspective, it is not unexpected that empirically the relationship between human score and many of the features characterizing the responses are often not linear. Given the fortunate bounty of linear approximations over much of the range of these features, a linear approximation will often fit quite well. It is generally only at the extremes that we see notable deviation from linearity, though these cases commonly cause the most consternation. Beyond deviations at the extremes, the next most frequent nonlinearity arises in features that have a delimited range of optimal values, with human scores decreasing on both sides away from the optimal values in an inverted "u" shaped curve. An example of this second nonlinearity class is measures of coherence. As measured for example by the semantic similarity between sentences, we expect that incoherent (low coherence) and highly redundant (high coherence) responses will receive lower human scores implying that this type of relationship will not adequately be described by a linear model. A final example of nonlinearity is that many features reach asymptotic values, where for example appropriate use of advanced punctuation improves the quality of writing up to a point, but its contribution levels out or may even decline after that point. The following plots demonstrate some of these issues. The distribution of student responses in feature space are represented by a kernel density smoothing of the points, where the low to high density regions of responses range from white through darkening shades of blue, with the most dense regions nearly black. The linear regression line is shown in green and a locally weighted regression curve (Loader, 1999; 2012) is shown in red. Improving performance of automated scoring 13 The first plot shows how human score varies by sentence length measured as words per sentence (WPS). The density plot clearly shows that most scores fall in an area bounded between about 15 and 22 words per sentence with scores ranging from .25 to about .75 for this item. For reference, the median response is 20.2 WPS, with a lower quartile of 17.3 and an upper quartile of 24.0. We also see that WPS is not a particularly informative feature as indicated by the shallow slope of the regression line. The red, locally weighted regression line indicates that for a substantial portion of the WPS range, a linear approximation is a reasonable approximation. However, at the low end of WPS the linear model does not match human scoring, in that it awards too high scores and at the upper end of WPS, it again overestimates the contribution of WPS to human score. We see from the locally weighted regression curve that for WPS above about 25, there is no additional information on score from WPS, and in fact there may be a slight decrease in human scores at very high WPS, though this may be an artifact. This is an example of a feature reaching an asymptote described above. While there is evidence that WPS is part of the overall evaluation humans apply to the responses, it is also clear that using WPS as a feature requires additional checking to ensure the sentences are constructed in a sensible fashion. Improving performance of automated scoring 14 Figure 5: Smooth plot of Human Score vs. Words per Sentence. Density of responses range from white, low density, to dark blue, high density. Green line is the linear regression fit, and red curve is the locally weighted regression fit. The grey horizontal line is a visual aid to indicate the flattening of the locally weighted regression fit. Improving performance of automated scoring 15 The second plot shows how human score varies with Sentence to Sentence coherence (e.g., Foltz et al., 1998). The linear model indicates a steadily decreasing score with increasing coherence, but the locally weighted regression tells a quite different story. Here for responses exhibiting low coherence, there is a region where increasing coherence is recognized and rewarded by the human scorers up to a point and from that point onward a decreasing linear model provides a reasonable approximation. An additional deviation can be seen that above coherence of about 0.5, the humans awarded lower scores at a faster rate than a linear model would allow. Again, in the case of this coherence feature, there is a large region where the linear model is a reasonable approximation, but at the low and high ends the model diverges from the evaluation of human scorers. Improving performance of automated scoring 16 Figure 6: Smooth plot of Human Score vs. Sentence to Sentence coherence. Density of responses range from white, low density, to dark blue, high density. Green line is the linear regression fit, and red curve is the locally weighted regression fit. In the mid-range a linear model provides a reasonable approximation, but there is divergence to the human ratings at the ends. Improving performance of automated scoring 17 As these two examples show, assuming a linear model could go astray, especially at feature values nearing the extremes seen in the training set. Possible solutions are using different models at the extremes or using modeling techniques that support more general functional forms. Explanations of performance beyond length It is well known that the length of essays for a given item often correlate highly with human scores. For instance in the ASAP study (Shermis & Hammer, 2012) of the nine sets of human scores for eight items, two sets of human scores correlated with word count above 0.8 and all correlations were above 0.5. There are sensible reasons why length serves as a proxy for attributes of the quality of a response, such as adequate content coverage requires a sufficient length to cover the topic and students without much knowledge on a topic or with low language ability typically can not generate sufficient words during a timed essay test. However, automated scoring best practice requires using more construct relevant approaches to predicting score with features less obviously tied to length than unadorned measures such as word count. In addition, overall length is a quite easy parameter to pad in bad-faith attempts. A difficulty in implementing a policy based on downgrading the importance of length in scoring is that many useful features are highly correlated with length. For instance, the ratio of word types to word tokens gives a measure of diversity of vocabulary. However, at least for shorter essays it is also quite coupled to length. Reflecting on the mechanism of this ratio reveals that a one-word essay attains the maximum ratio of one, which initially can only decrease as the length of the essay increases, until it normalizes to provide useful information. Improving performance of automated scoring 18 For an example case, we examine a class of semantic variables that are based on distances in the semantic space (e.g., Landauer et al., 2001). Distances in semantic space provide measures of the degree to which a target essay has similar content to essays from the training set. Due to the nature of the distance measure, it partially confounds with response length. For example, essays that all have similar content all tend to be of similar length. Using multidimensional scaling, we were able to generate new features based on these distance that are significantly less correlated to response length, but still allow semantic similarity to explain much of the human score. In the following pair of plots, the actual points are identical, just the coloring is different, with responses arrayed on two new derived dimensions. The responses are colored to indicate human score, going from saturated cyan for the low scores to saturated magenta at the high end. We see that scores load quite nicely on the y-dimension. Notice the cluster of low scores at the top left. A hypothesis that is quickly validated in the right plot is that these are all very short, low-scoring essays. The right plot colors the deciles of length by word count. We see that the first component heavily loads on length. Overall this is the kind of result that indicates that it is possible to separate length from the semantic component while preserving the ability of a derived feature to predict human score based on content. Thus although essay length may insidiously influence many common features, it can be partialed out in a manner to allow measurements that are not influenced by the padding of extra words. Improving performance of automated scoring 19 Figure 7: Responses arrayed in two derived distance dimensions. Left plot colors responses by human score, while right plot colors by length. The derived dimensions predominately separate out score and
منابع مشابه
Detection of Outliers and Influential Observations in Linear Ridge Measurement Error Models with Stochastic Linear Restrictions
The aim of this paper is to propose some diagnostic methods in linear ridge measurement error models with stochastic linear restrictions using the corrected likelihood. Based on the bias-corrected estimation of model parameters, diagnostic measures are developed to identify outlying and influential observations. In addition, we derive the corrected score test statistic for outliers detection ba...
متن کاملIdentification of outliers types in multivariate time series using genetic algorithm
Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...
متن کاملSemi-quantitative segmental perfusion scoring in myocardial perfusion SPECT: visual vs. automated analysis
Introduction: It is recommended that the physician apply at least a semi-quantitative segmental scoring system in myocardial perfusion SPECT. We aimed to assess the agreement between automated semi-quantitative analysis using QPS (quantitative Perfusion SPECT) software and visual approach for calculation of summed stress score (SSS), summed rest score (SRS) and summed difference score (SDS). ...
متن کاملIntroduction Package CircOutlier For Detection of Outliers in Circular-Circular Regression
One of the most important problem in any statistical analysis is the existence of unexpected observations. Some observations are not a part of the study and are known as outliers. Studies have shown that the outliers affect to the performance of statistical standard methods in models and predictions. The point of this work is to provide a couple of statistical package in R software to identi...
متن کاملA Robust Dispersion Control Chart Based on M-estimate
Process control charts are proven techniques for improving quality. Specifying the control limits is the most important step in designing a control chart. The presence of outliers may extremely affect the estimates of parameters using classical methods. Robust estimators which are not affected by outliers or the small departures from the model assumptions are applied in this paper to specify th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013